8/15/2017

Introduction

  1. The Case
  2. Learning Goals
  3. The Data Science Process
  4. The Tidyverse

The Case

  • AirBnB is a website that allows hosts to rent their homes to travelers and tourists.
  • AirBnB can maximize its business by ensuring it has enough listings in different neighborhoods.
  • AirBnB has asked you, a data analytics consultant, to help them identify neighborhoods in Boston where they should focus on recruitment through advertising and incentives.
  • They have given you a complex, multi-part data set to answer this question.

What We'll Do This Morning

  • Work with real and interesting data
  • Create appealing and informative visualizations
  • Draw useful conclusions

David Robinson

Exercise 0

  1. Look left.
  2. Look right.
  3. Pick a partner.
  4. Give them a professional, yet friendly smile. You are going to need them soon.

FYI: Base R

  • R without any packages offers ways to do most of the things we will see today.
  • But base R is not a very good programming language.
  • So, we will use….

The Tidyverse

  • "Programs must be written for people to read and only incidentally for machines to execute" - Hal Abelson

The Tidyverse is a set of packages that promote code which is:

  • Easy to read and write
  • Highly performant
  • Consistent across the data science workflow

When you master the Tidyverse, you spend more time thinking about your problem and less time thinking about your code.

The Process of Data Science

Getting Started

  1. Data Import and Inspection
  2. Data Subsetting
  3. The Pipe

Our Data Today

AirBnB listings, schedule, and review text for the Boston area, for a time period…that you will check.

library(tidyverse)
listings <- read_csv('../data/listings.csv')
calendar <- read_csv('../data/calendar.csv')

Warmup: The Nicest Places in JP

  • Question: What are the "nicest" places to stay in Jamaica Plain?
  • Task: Construct a table of listings in JP, sorted in descending order by rating.

Conceptually, we need to: filter to only JP listings, arrange the listings by rating, and select only the columns we want to see. Let's write some code!

The Pipe

Stuff on the pipe here

Exercise 1:

Working with your partner, please rewrite the JP code using the pipe operator. Here's the first line to get you started:

listings %>% 
    filter(neighbourhood == 'Jamaica Plain')
# ...

Exercise 1 Sample Code

listings %>% 
    filter(neighbourhood == 'Jamaica Plain') %>%      # filter needs a logical test
    arrange(desc(review_scores_rating)) %>%           # desc() makes descending order 
    select(neighbourhood, name, review_scores_rating)

Exercise 1 Sample Output

Exercise 2: The Biggest Place in Back Bay

You are going to spend a long weekend in Back Bay with 50 of your closest friends.

Working with your partner, modify your code slightly to construct a table of the listings in Back Bay, sorted by the number of people who can stay there. You may need to use glimpse to see which columns you'll want to use.

Exercise 2: Sample Code

listings %>% 
    filter(neighbourhood == 'Back Bay') %>% 
    arrange(accommodates) %>% 
    select(neighbourhood, name, accommodates, price)

Exercise 2: Sample Output

Summarising Data

  1. Summary Statistics
  2. Adding Columns
  3. Grouping

Exercise 3

Modify the summary table as follows:

  1. Group by property_type as well as neighborhood. How does this impact the output?
  2. Add a new column to the table with the average rate per person for WEEKLY rentals, using the weekly_price column.
  3. Add a new column giving the total "capacity" for each neighborhood, given as the total number of beds in rentals in that neighborhood.

Exercise 3 Sample Code

summary_table <- listings %>% 
    mutate(price = price %>% gsub('\\$|', '',.) %>% as.numeric(),
           price_per = price / accommodates,
           weekly_price = weekly_price %>% gsub('\\$|', '',.) %>% as.numeric(),
           weekly_price_per = weekly_price / accommodates)  %>% 
    group_by(neighbourhood, property_type) %>% 
    summarize(n = n(), 
              mean_rating = mean(review_scores_rating, na.rm = TRUE),
              price_per = mean(price_per, na.rm = TRUE),
              weekly_price_per = mean(weekly_price_per, na.rm = T),
              capacity = sum(beds)) 

Exercise 3 Sample Output

Exercise 4

If you check summary_table, you'll notice that it has exactly one grouping column: neighbourhood. Working with your partner,

  1. Explain why there is just one grouping column when we made two groups in Exercise 3.
  2. Use mutate to construct a rank column where the top-ranked row has the highest value of n (i.e. most popular type). You might want the min_rank function and the desc function we used when sorting. What behavior do you observe?
  3. Filter to include only rows of rank 3 or less.
  4. Sort the rows in descending order by n using arrange. What behavior do you observe now? See if you can get the table sorted in descending order by rank, while keeping neighbourhoods grouped together.

Exercise 4 Sample Code

ranked_summary_table <- summary_table %>% 
    mutate(rank = min_rank(desc(n))) %>% 
    filter(rank <= 3) %>% 
    arrange(neighbourhood, rank)

Exercise 4 Sample Output

Keeping Current

  1. More practice with filter and summarise
  2. joining data

How Current Is This Info?

calendar %>% 
    summarise(earliest = min(date), 
           latest = max(date))
## # A tibble: 1 x 2
##     earliest     latest
##       <date>     <date>
## 1 2016-09-06 2017-09-05

Pretty current. But what if we want to focus on listings that have been active in the last three months?

Exercise

Construct a table from the calendar data giving the listings that had a valid listed date between June 1st, 2016 and today. You determine what "valid" means in this context.

Hint: You can represent a date using the function

lubridate::mdy('6/1/2016')
## [1] "2016-06-01"

You can also use lubridate::today(). You can use max() to get the most recent date.

Sample Solution: Code

current_table <- calendar %>% 
    filter(!is.na(price), 
           date < lubridate::today(),
           date > lubridate::mdy('6/1/2016')) %>%
    group_by(listing_id) %>% 
    summarise(last_active = max(date))

Sample Solution: Output

Relational Data

The information we need is distributed between two tables – how can we get there?

We need a key that tells us which calendar rows correspond to which listings.

listings$id corresponds to calendar_listing$id

join

The join family of functions lets us add columns from one table to another using a key.

  • x %>% left_join(y) : most common, keeps all rows of x but not necessarily y.
  • x %>% right_join(y) : keeps all rows of y but not necessarily x.
  • x %>% outer_join(y) : keeps all rows of both x and y
  • x %>% full_join(y) : keeps only rows of x that match in y and vice versa.

We'll use left_join for this case.

Getting Visual

  1. Graphical Excellence
  2. The Grammar of Graphics
  3. Wrangling and Visualization

Graphical Excellence

Graphical excellence is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design. Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency. – Edward Tufte

The Grammar of Graphics

A grammar is a set of components (ingredients) that you can combine to create new things. Many grammars have required components: if you're missing one, you're doing it wrong. In baking….

  • A body – typically some kind of flour)
  • Binder – eggs, oil, butter, or applesauce
  • A rising agent – yeast, baking soda, baking powder
  • Flavoring – sugar, salt, chocolate, vanilla

The Grammar of Graphics

  • Puts the gg in ggplot2.
  • Formulated by Leland Wilkinson.
  • Implemented in code by Hadley Wickham, now part of the tidyverse

Ingredients of a data visualization

  • Data: almost always a data_frame
  • Aesthetic mapping: relation of data to chart components.
  • Geometry: specific visualization type? E.g. line, bar, heatmap?
  • Statistical transformation: how should the data be transformed or aggregated before visualizing?
  • Theme: how should the non-data parts of the plot look?
  • Misc. other options.

Data, aesthetics, and geometries are the required grammatical components that you always need to specify.

First Plot

Does getting lots of reviews usually mean you get good reviews?

listings %>% 
    ggplot()

First Plot

listings %>% 
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating)

First Plot

listings %>% 
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating) + 
    geom_point()

First Plot

listings %>% 
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating) + 
    geom_point(alpha = .2) 

First Plot

listings %>% 
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating) + 
    geom_point(alpha = .2) + 
    theme_bw()

First Plot

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating) + 
    geom_point(alpha = .2) + 
    theme_bw() 

First Plot

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating) + 
    geom_point(alpha = .2) + 
    theme_bw() + 
    labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality') 

First Plot

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating) + 
    geom_point(alpha = .2, color = 'firebrick') + 
    theme_bw() + 
    labs(x='Number of Reviews', y='Review Score',title='Review Volume and Review Quality') 

Changing It Up

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = review_scores_value, 
        y = review_scores_location, 
        size = number_of_reviews) + 
    geom_point(alpha = .2, color = 'firebrick') + 
    theme_bw() 

As a Heatmap

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = review_scores_value, 
        y = review_scores_location, 
        fill = number_of_reviews) + 
    geom_tile() + 
    theme_bw() 

Exercise 6

The following code computes the average price of all listings on each day in the data set:

average_price_table <- calendar %>% 
    mutate(price = price %>% gsub('\\$|', '',.) %>% as.numeric()) %>% 
    group_by(date) %>% 
    summarise(mean_price = mean(price, na.rm = TRUE))

Use geom_line() to visualize these prices with time on the x-axis and price on the y-axis.

Exercise 6 Sample Solution

 average_price_table %>% 
    ggplot() + 
    aes(x = date, y = mean_price) + 
    geom_line()

Exercise 7

Using the summary_table object you created earlier, make a bar chart showing the number of apartments by neighbourhood. In this case, the correct geom to use is geom_bar(stat = 'identity').

Exercise 7 Sample Solution

summary_table %>% 
    filter(property_type == 'Apartment') %>% 
    ggplot() + 
    aes(x = neighbourhood, y= n) + 
    geom_bar(stat = 'identity')

Let's Clean This Up a Bit

summary_table %>% 
    filter(property_type == 'Apartment') %>% 
    ggplot() + 
    aes(x = reorder(neighbourhood, n), y=n) + 
    coord_flip() + 
    geom_bar(stat = 'identity')

Comparisons: Fill, Color, and Facets

From Exercise 7

summary_table %>% 
    ggplot() + 
    aes(x = reorder(neighbourhood, n), y=n, fill = property_type) + 
    coord_flip() + 
    geom_bar(stat = 'identity') 

From Our First Plot

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating, color = property_type) + 
    geom_point(alpha = .5) + 
    theme_bw() + 
    labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality') 

From Our First Plot

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating, color = property_type) + 
    geom_point(alpha = .5) + 
    theme_bw() + 
    facet_wrap(~property_type) + 
    labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality') 

Optional: Score Types

listings %>% 
    select(number_of_reviews, contains("review_scores"), - review_scores_rating) %>% 
    gather(key = type, value = score, -number_of_reviews) %>% 
    ggplot() + 
    aes(x = factor(score), y = number_of_reviews) + 
    geom_boxplot() + 
    facet_wrap(~type)

Mini-Project

Project Description

We are going to make a simple business intelligence (BI) dashboard for AirBnB, using the wrangling and visualization skills we have developed in this session. You will use this dashboard to lead a meeting with decision-makers on where to prioritize host recruitment efforts.

Instructions

  1. Please open wrangle_viz/dashboard.Rmd
  2. Click the knit button at the top of RStudio and observe the result. If you see a dashboard, then are good to go.
  3. Modify the dashboard:
    • Include your names in the author metadata up top
    • Write code for data preparation and visualizations as appropriate. You should aim to include a main visualization and a supporting one.
    • Include all your code in the R "code chunks" that begin with ```{r}
    • Add commentary in the indicated area
  4. Once you are done, knit the dashboard one last time and place it in the shared Dropbox folder.

You are done iff

  1. You have added your names to the dashboard
  2. You have replaced "Box 1", "Box 2", and "Box 3" with informative subtitles
  3. You have created two original visualizations aimed at supporting a clear message about where AirBnB should recruit hosts.
  4. You have written brief commentary clarifying your message.
  5. You have submitted your dashboard to the shared Dropbox.

Additional Resources

Visualization Problems are Wrangling Problems

"In my experience, the vast majority of graphing agony is due to insufficient data wrangling." - Jenny Bryan

Additional Resources

Map of the Tidyverse